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Abstract 

Data collection at a massive scale is becoming ubiquitous in a wide variety of settings, 
from vast offline databases to streaming real-time information. Learning algorithms 
deployed in such contexts must rely on single-pass inference, where the data history 
is never revisited. In streaming contexts, learning must also be temporally adaptive 
to remain up-to-date against unforeseen changes in the data generating mechanism. 
Although rapidly growing, the online Bayesian inference literature remains challenged by 
massive data and transient, evolving data streams. Non-parametric modelling techniques 
can prove particularly ill-suited, as the complexity of the model is allowed to increase 
with the sample size. In this work, we take steps to overcome these challenges by porting 
standard streaming techniques, like data discarding and downweighting, into a fully 
Bayesian framework via the use of informative priors and active learning heuristics. We 
showcase our methods by augmenting a modern non-parametric modelling framework, 
dynamic trees, and illustrate its performance on a number of practical examples. The 
end product is a powerful streaming regression and classification tool, whose performance 
compares favourably to the state-of-the-art. 



1 Introduction 

In online inference, the objective is to develop a set of update equations that incorporate novel 
information as it becomes available, without needing to revisit the data history. This results 
in model fitting algorithms whose space and time complexity remains constant as information 
accumulates, and can hence operate in streaming environments featuring continual data 
arrival, or navigate massive datasets sequentially. Such operational constraints are becoming 
imperative in certain application areas as the scale and real-time nature of modern data 
collection continues to grow. 

* Corresponding author. 
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In certain simple cases, online estimation without information loss is possible via exact 
recursive update formulae, e.g., via conjugate Bayesian updating (see Section 3.1). In para- 
metric dynamic modelling, approximate samples from the filtering distribution for a variable 
of interest may be obtained online via sequential Monte Carlo (SMC) techniques, under quite 
general conditions. In Taddy et al. (2011), SMC is used in a non-parametric context, where 
a 'particle cloud' of dynamic trees are employed to track parsimonious regression and classi- 
fication surfaces as data arrive sequentially. However, the resulting algorithm is not, strictly 
speaking, online, since tree moves may require access to the full data history, rather than 
parametric summaries thereof. This complication arises as an essential by-product of non- 
parametric modelling, wherein the complexity of the estimator is allowed to increase with 
the dataset size. Therefore, this article recognises that maintaining constant operational cost 
as new data arrives necessarily requires discarding some (e.g., historical) data. 

Specifically, and to help set notation for the remainder of the paper, we consider super- 
vised learning problems with labelled data (xt, yt), for t = 1,2, ... ,T, where T is either very 
large or infinite. We consider responses yt which are real-valued (i.e., regression) or categori- 
cal (classification). The p-dimensional predictors may include real- valued features, as well 
as binary encodings of categorical ones. The dynamic tree model, reviewed shortly in Sec- 
tion [2j allows sequential non-parametric learning via local adaptation when new data arrive. 
However its complexity, and thus computational time/space demands, may grow with the 
data size t. The only effective way to limit these demands is to sacrifice degrees-of-freedom 
(DoF) in representation of the fit, and the simplest way to do that is to discard data; that 
is, to require the trees to work with a subset w <C t of the data seen so far. 

Our primary concern in this paper is managing the information loss entailed in data dis- 
carding. First, we propose datapoint retirement (Section [3]), whereby discarded datapoints 
are partially 'remembered' through conjugate informative priors, updated sequentially. This 
technique is well-suited to trees, which combine non-parametric flexibility with simple para- 
metric models and conjugate priors. Nevertheless, forming new partitions in the tree still 
requires access to actual datapoints, and consequently data discarding comes at a cost of 
both information and flexibility. We show that these costs can be managed, to a surprising 
extent, by employing the right retirement scheme even when discarding data randomly. In 
Section |4j we further show that borrowing active learning heuristics to prioritise points for 
retirement, i.e., active discarding, leads to better performance still. 

An orthogonal concern in streaming data contexts is the need for temporal adaptivity 
when the concept being learned exhibits drift. This is where the data generating mechanism 
evolves over time in an unknown way. Recursive update formulae will generally require 
modification to acquire temporally adaptive properties. One common approach is the use of 
exponential downweighting, whereby the contribution of past datapoints to the algorithm is 
smoothly downweighted via the use of forgetting factors. In Section [5] we demonstrate how 
historical data retirement via suitably constructed informative priors can reproduce this 
effect in the non-parametric dynamic tree modelling context, while remaining fully online. 
Using synthetic as well as real datasets, we show how this approach compares favourably 
against modern alternatives. The paper concludes with a discussion in Section [6j 
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2 Dynamic Trees 



Dynamic trees (DTs) (Taddy et al. 2011) are a process-analog of Bayesian treed models 



( Chipman et al.| 1998 , 2002 ). The model specification is amenable to fast sequential inference 
by SMC, yielding a predictive surface which organically increases in complexity as more data 
arrive. Software is available in the dynaTree package (Gramacy and Taddy 2011) for R on 
CRAN (R Development Core Team 2010), which has been extended to cover the techniques 
described in this paper. We now review model specification and inference in turn. 



2.1 Bayesian static treed models 

Trees partition the input space X into hyper-rectangles, referred to as leaves, using nested 
logical rules of the form (xi > c). For instance, the partition (x\ > 3) n (x2 < — 1), 
(xi > 3) n (x2 > —1) and (x\ < 3) represent a tree with one internal node, (x2 > —1), and 
three leaves. We denote by 17 (x) the unique leaf where x belongs to, for any x G X. 

Treed models condition the likelihood function on a tree T and fit one instance of a 
given simple parametric model per leaf. In this way, a flexible model is built out of simple 
parametric models (^) r)g £ r , where £7- is the set of leaves in T ■ This flexibility comes at the 
price of a hard model search and selection problem: that of selecting a suitable tree structure. 



In the seminal work of Chipman et al. (1998), a Bayesian solution to this problem was 



proposed that relied on a generative prior distribution over trees: a leaf node r\ may split with 
probability p S piit (T, rj) = a(l + D v )~P , where a, f3 > 0, and D v is the depth of rj in the tree T. 
This induces a joint prior via the probability that internal nodes If have split and leaves £7- 
have not: 7r(T) oc Y\ ve x T PspHt(T,r])Y\ veLT [l - p sp nt(T, rj)]. The specification is completed 
by employing independent sampling models at the tree leaves: p(yi, . . ■ , y n \T, xi, . . . , x„) = 
n» ) e£ r P(y v \T, x* 7 ). Sampling from the posterior proceeds by MCMC, via proposed local 
changes to T: so-called grow, prune, change, and swap "moves". Any data type/model 
may be used as long as the marginal likelihoods p(y r, \T, x 7 ?) are analytic, i.e., as long as 
their parameters can be integrated out. This is usually facilitated by fully conjugate, scale 
invariant, default (non-informative) priors, e.g.,: 



y 



for linear, or, letting (3 V = 0, constant regression leaves. Similarly, multinomial leaves for 
classification with Dirichlet priors can be employed. These choices yield analytical posteriors 



(Taddy et al. , 2011 ) but also efficient recursive updates for incorporating new datapoints (see 
Section 3.1 ). 



2.2 Dynamic Trees 

In DTs the "moves" are embedded into a process, which describes how old trees mature into 
new ones when new data arrive. Suppose that It— 1 represents a set of recursive partitioning 
rules associated with x t_1 , the set of covariates observed up-to time t — 1. The fundamental 
insight underlying the DT process is to view this tree as a latent state, evolving according to 
a state transition probability, P(% \ 7t_i,Xt). The dependence on x t (but not on y t ) allows 
us to consider only moves local to the current observation: i.e., pruning or growing can only 
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occur (if at all) for the leaf rjfet). This builds computational tractability into the process, 
as we eitherway need to recompute in that area. Formally, we let: 




. w. if Tt is not reachable from Tt-\ via moves local to Xf 

P(T t | 7t_i,xt) = { lrr ^ . (2) 

otherwise. 

where p m is the probability of the unique move that can produce % from 7t— 1, and it is 
the tree prior. We allow three types of moves: grow, prune and stay moves. Each type is 
considered equiprobable, whereas for grow moves, we choose among all possible split locations 
by first choosing a dimension j uniformly at random, and splitting r/(x^) around the location 
Xj = £ chosen uniformly at random from the interval formed from the projection of rj(xt) 
on the jth input dimension. The new observation, yt, completes a stochastic rule for the 
update 7t-i — > T via p(y*|7t, x*) for each Tt G {Tt}. 



The DT specification is amenable to Sequential Monte Carlo (e.g., Carvalho et al. 2010) 



inferential mechanics. At each iteration t, the discrete approximation to the tree posterior 
{7f-i}^Li> based on N particles, can be updated to {7^}^ by resampling and then prop- 
agating. Resampling the particles (with replacement) proceeds according to their predictive 
probability for the next (x, y) pair, iV{ = p(yt\T t ^ l ,xt). Then, propagating each resampled 
particle follows the process outlined in 2.2. Both steps are computationally efficient because 
they involve only local calculations (requiring only the subtrees of the parent of each r/ l )(x)). 
Nevertheless, the particle approximation can shift great distances in posterior space after 
an update because the data governed by r/(xj)W may differ greatly from one particle to 
another, and thus so may the weights wi. This appealing division of labour mimicks the 
behaviour of an ensemble method without explicitly maintaining one. As with all particle 
simulation methods, some Monte Carlo (MC) error will accumulate and, in practice, one 
must be careful to assess its effect. Nevertheless, DT out-of-sample performance compares 
favourably to other nonparametric methods, like Gaussian processes (GPs) regression and 
classification, but at a fraction of the computational cost (Taddy et al. 2011| ). 



3 Datapoint retirement 



At time t, the DT algorithm of Taddy et al. (2011) may need to access arbitrary parts 



of the data history in order to update the particles. Hence, although sequential inference 
is fast, the method is not technically online: tree complexity grows as logt, and at every 
update each of the x* = (xi,...,x t ) locations are candidates for new splitting locations 
via grow. To enable online operation with constant memory requirements, this covariate 
pool (x*) must be reduced to a size w, constant in t. This can only be achieved via data 
discarding. Crucially, the analytic/parametric nature of DT leaves enables a large part of 
any discarded information to be retained in the form of informative leaf priors. In effect, this 
yields a soft implementation of data discarding, which we refer to as datapoint retirement. 
We show that retirement can preserve the posterior predictive properties of the tree even 
after data are discarded, and furthermore following subsequent prune and stay operations. 
The only situation where the loss of data hurts is when new data arrive and demand a more 
complex tree. In that case, any retired points would not be available as anchors for new 
partitions. Again, since tree operations are local in nature, only the small subtree nearby 
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rj(xt) is effected by this loss of DoFs, whereas the compliment % \ il( x t), i-e., most of the 
tree, is not affected. 



3.1 Conjugate informative priors at the leaf level 

Consider first a single leaf r\ E 7t in which we have already retired some data. That is, 
suppose we have discarded (x s ,y s )/ s i which was in n in %> at some time t' < t. The 
information in this data can be 'remembered' by taking the leaf-specific prior, vr(^), to be 
the posterior of 9^ given (only) the retired data. Suppressing the n subscript, we may take 
tt(0) =df P (6 | (x S j y s ){ s }) c< L [6; (x s , y s ){ s }) tto(0) where tvo(9) is a baseline non-informative 
prior employed at all leaves. The active data in rj, i.e., the points which have not been retired, 
enter into the likelihood in the usual way to form the leaf posterior. 

It is fine to define retirement in this way, but more important to argue that such retired 
information can be updated loslessly, and in a computationally efficient way. Suppose we wish 
to retire one more datapoint, (x r ,y r ). Consider the following recursive updating equation: 

vr( ncw )(fl) = di P(9 | (x s ,y s ) {s}ir ) <x L(9;Xr,y r )P (9 | (x s ,y s ) {s} ) . (3) 

As shown below, the calculation in ^ is tractable whenever conjugate priors are employed. 

Consider first the linear regression model, y ~ N(X.ft, <x 2 I), where y = (y s ){ s } is the 
retired response data, and X the retired augmented design matrix, i.e., whose rows are like 
[l,Xg]', so that Pi represents an intercept. With -Ko(ft,a 2 ) oc -\, we obtain: 



vr(/3, a 2 ) = df P{ft, a 2 | y, X) = NIG(i//2, su/2, ft, Q 



where NIG stands for Normal-Inverse-Gamma, and assuming the Grahm matrix Q = X'X 
is invertible and denoting Xy = X'y, r = y'y, we have v = n — p, ft = Q~ 1 Xy, and 
s 2 = i(r — TZ), where 1Z = ft G~ 1 ft- Having discarded (y s ,x. s )r s \, we can still afford to keep 
in memory the values of the above statistics, as, crucially, their dimension does not grow 
with \{s}\. Updating the prior to incorporate an additional retiree (y r ,x r ) is easy: 

g(*ew) =g + X ' r X r , AV n ° w) = Xy + X' r y r , s^=s + y' r y r , y^=u+l. (4) 

The constant leaf model may be obtained as a special case of the above, where x* = 1, 
Q = v and ft = fi. For the multinomial model, the discarded response values y s may be 
represented as indicator vectors z s , where Zj s = l(y s = j)- The natural conjugate here is the 
Dirichlet D(a). The hyperparameter vector a may be interpreted as counts, and is updated 
in the obvious manner, namely a^ new ^ = a + z r where Zj m = l(y r = j). A sensible baseline 



is ao = (1, 1, . . . , 1). See O'Hagan and Forster (2004) for more details. 

Unfolding the updating equations (|3j) and (|4 ) makes it apparent that retirement preserves 
the posterior distribution. Specifically, the posterior probability of parameters 9, given the 
active (non-retired) data still in n is 



7T 



|x", y v ) oc L(9; x\y^{9) oc L(9; x\ jfi)L(0; (x s ,y s ) {s} )7r (9) = L(9; , /)7r o (0), 



where n' is n without having retired (x s , y s ){ s }- Since the posteriors are unchanged, so are the 
posterior predictive distributions and the marginal likelihoods required for the SMC updates. 
Note that new data (xt+i, yt+i) which do not update a particular node n E % — >■ %+i do not 
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change the properties of the posterior local to the region of the input space demarcated by 77. 
It is as if the retired data were never discarded. Only where updates demand modifications 



of the tree local to 7/ is the loss in DoF felt. We argue in Section 3.2 that this impact can be 
limited to operations which grow the tree locally. Cleverly choosing which points to retire 
can further mitigate the impact of discarding (see Section [4]). 

3.2 Managing informative priors at the tree level 

Intuitively, DTs with retirement manage two types of information: a non-parametric memory 
comprising an active data pool of constant size w <C t, which forms the leaf likelihoods; and a 
parametric memory consisting of possibly informative leaf priors. The algorithm we propose 
proceeds as follows. At time t, add the t th datapoint to the active pool, and update the model 
by SMC exactly as explained in Section [2] Then, if t exceeds w, also select some datapoint, 
(x r ,y r ), and discard it from the active pool (see Section [4] for selection criteria), having first 
updated the associated leaf prior for r/(x r )W, for each particle i = 1, . . . , N, to 'remember' 
the information present in (pc r ,y r ). This shifts information from the likelihood part of the 
posterior to the prior, exactly preserving the time-i posterior predictive distribution and 
marginal likelihood for every leaf in every treeQ 

The situation changes when the next data point (x^+i, y y +\) arrives. Recall that the DT 
update chooses between stay, prune, or grow nearby each rj(xt+i)^. Grow and prune moves 
are affected by the absence of the retired data from the active data pool. In particular, 
the tree cannot grow if there are no active data candidates to split upon. This informs our 
assessment of retiree selection criteria in Section 4, as it makes sense not to discard points 
in parts of the input space where we expect the tree to require further DoFs. Moreover, 
we recognise that the stochastic choice between the three DT moves depends both upon 
the likelihood, and retired (prior) information local to r/(xj + i)w, so that the way that prior 
information propagates after a prune, or grow move, matters. The original DT model dictates 
how likelihood information (i.e., resulting from active data) propagates for each move. We 
must provide a commensurate propagation for the retired information to ensure that the 
resulting online trees stay close to their full data counterparts. 

If a stay move is chosen stochastically, no further action is required: retiring data has no 
effect on the posterior. When nodes are grown or pruned, the retiring mechanism itself, which 
dictates how informative priors can salvage discarded likelihood information, suggests a 
method for splitting and combining that information. Following a prune, retired information 
from the pruned leaves, 77 and its sibling S(ri), must be pooled into the new leaf prior 
positioned at the parent P{rj). Conjugate updating suggests the following additive rule: 

gP(n) = gn + g s{v) j j/W = X y*> + Xy s ^ s ^)=^ + s ^) ; ^to = ^ + ^fo>. 

Note that this does not require access to the actual retired datapoints, and would result in 
the identical posterior even if the data had not been discarded. 

A sensible grow move can be derived by reversing this logic. We suggest letting both 
novel child leaves £(77) and r(rj) inherit the parent prior, but split its strength v 71 between 



1 In fact, every data point under active management can [in a certain limited sense] be retired without 
information loss. 
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them at proportions equal to the active data proportions in each child. Let a = n^p- Then, 



v rM = (1 - a)i/„, G r ^ = (l-a)g\ Xy r ^ = (l-a)Xy\ s r ^ = {l-a)s\ 

In other words, the new child priors share the retired information of the parents with weight 
proportional to the number of active data points they manage relative to the parent. This 
preserves the total strength of retired information, preserves the balance between active data 
and parametric memory, and is reversible: subsequent prune operations will exactly undo 
the partitioned prior, combining it into the same prior sufficient statistics at the parent. 

This brings to light a second cost to discarding data, the first being a loss of candidates 
for future partitioning. Nodes grown using priors built from retired points lack specific 
location information from the actual retired (x s , y s ) pairs. Therefore newly grown leaves must 
necessarily compromise between explaining the new data, e.g., (xt+i, yt+l), with immediately 
local data active data to r](x)t+i, and information from retired points with less localised 
influence. The weight of each component in the compromise is |^|/(|^| + v^) and Vn/(\v\ + 
Ujj), respectively. Eventually as t grows, with w <C t staying constant, retired information 
naturally dominates, precluding new grows even when active partitioning candidates exist. 
This means that while the hierarchical way in which retired data filters through to inference 
(and prediction) at the leaves is sensible, it is doubly-important that data points in parts of 
the input space where the response is very complex should not be discarded. 



4 Active discarding 

It matters which data points are chosen for retirement, so it is desirable to retire datapoints 
that will be of "less" use to the model going forward. In the case of a drifting concepts, 
retiring historically, i.e., retiring the oldest datapoints, may be sensible. We address this 
in Section [5} Here we consider static concepts, or in other words i.i.d. data. We formulate 
the choice of which active data points to retire as an active discarding (AD) problem by 
borrowing (and reversing) techniques from the active learning (AL) literature. Regression 
and classification models separately, as they require different AD techniques. We shall argue 
that in both cases AD is, in fact, easier than AL since DTs enable thrifty analytic calculations 
not previously possible, which are easily updated within the SMC. 



4.1 Active discarding for regression 



Active learning (AL) procedures are sequential decision heuristics for choosing data to add 
to the design, usually with the aim of minimising prediction error. Two common AL heuris- 



1996 



tics are active learning MacKay (MacKay, 1992, ALM) and active learning Cohn (Cohn 



ALC). They were popularised in the modern nonpar ametric regression literature (Seo 



et al.[ |2000D using GPs, and subsequently ported to DTs ( |Taddy et alj |2011[ ). An ALM 
scheme selects new inputs x* with maximum variance for y(x*), whereas ALC chooses x* to 
maximise the expected reduction in predictive variance averaged over the input space. Both 
approximate maximum expected information designs in certain cases. ALC is computation- 
ally more demanding than ALM, requiring an integral over a set of reference locations that 
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can be expensive to approximate numerically for most models. But it leads to better explo- 
ration when used with nonstationary models like DTs because it concentrates sampled points 



near to where the response surface is changing most rapidly (Taddy et al. 2011). ALM has 
the disadvantage that it does not cope well with heteroskedastic data (i.e., input-dependent 
noise). It can end up favouring regions of high noise rather than high model uncertainty. 
Both are sensitive to the choice of (and density of) a search grid over which the variance 
statistics are evaluated. 

Our first simplification when porting AL to AD is to recognise that no grids are needed. 
We focus on the ALC statistic here because it is generally preferred, but also to illustrate 
how the integrals required are actually very tractable with DTs, which is not true in general. 
The AD program is to evaluate the ALC statistic at each active data location, and choose 
the smallest one for discarding. AL, by contrast, prefers large ALC statistics to augment the 
design. We focus on the linear leaf model, as the constant model may be derived as a special 
case. For an active data location x and (any) reference location z, the reduction in variance 
at z given that x is in the design, and a tree T is given by (see |Taddy et al.| ( |2011[ )): 



2 



A4NT) = A*U*\r,) - A*\V) ~ aim " M _ m _ 3 ~ X + ^ + x , & - V 

when both x and z are in rj £ £-/-, and zero otherwise. This expression is valid whether 
learning or discarding, however AL requires evaluating Aa 2 (x) over a dense candidate grid 
of j;'s. AD need only consider the current active data locations, which can represent a 
dramatic savings in computational cost. 
Integrating over z gives: 

Act 2 (x) = / Aal(z)dz = s v~ n v f ( 1 +z 'g-i x \ dy _ 

V; h* ^ ' (|7 ? |-m-3)(l + ^+x'^- 1 x) J v \\v\ " J 

The integral that remains, over the rectangular region rj, is tedious to write out but has a 
trivial 0(m 2 ) implementation. Let the m-rectangle rj be described by {(di,bi)} m . Then, 

dzi ■ ■ ■ dzm = A v c 2 + JJ(^fc - a k ) ) Xi(b 2 - a 2 ) 

+ E ( IK 6 * - ^ ) f^-^+EE II ^ - a ^ ) ^ 2 - a ")( b J - a % 

i \k^i J i j<i \k^i,j 

where z = z'Q^ 1 , and c = l/\rj\. A general-purpose numerical version via sums using R 




reference locations z — previously the state of the art (Taddy et al. 2011) — requires O(Rm) 
computation with R growing exponentially in m for reasonable accuracy. Observe that the 
rectangular leaf regions generated by the trees is key. In the case of other partition models 
(like Voronoi tessellation models), this analytical integration would not be possible. 

In repeated applications of ALC for AD, we observe that the active points that remain 
tend shuffle themselves so that they cluster near the high posterior partitioning boundaries, 
which makes sense because these are the locations where the predictive surface is changing the 
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t=25 t=75 t=150 t=300 




Figure 1: Snapshots of active data (25 points) and predictive surfaces spanning 275 retire- 
ment/updating rounds; Y{x) = x + x 2 + e, e ~ A/"(0, 1). 



fastest. The number of such locations depends on the number of active data points allowed, 
w. As an illustration, consider the simple example where the response is a parabolic function, 
which must be learned sequentially via x-data sampled uniformly in (—3,2), with w = 25. 
The initial 25, before any retiring, are shown in the first panel. Each updating round then 
proceeds with one retirement followed by one new pair, and subsequent SMC update. Since 
the implementation requires at least five points in each leaf, seeing four regimes emerge is 
perhaps not surprising. By t = 150, the third pane, the ability to learn about the mean with 
just 25 degrees of freedom is saturated, but it is possible to improve on the variance (shown 
as errorbars), which are indeed smaller in the final t = 300 pane. Eventually, the points will 
cluster at the ends because that is where the response is changing most rapidly, and indeed 
the derivative is highest there (in absolute value). 



4.2 Active discarding for classification 

For classification, predictive entropy is an obvious AL heuristic. Given a predictive surface 
comprised of probabilities p^(x) for each class I, from DTs or otherwise, the predictive 
entropy at x is — ^^^(x) logp^(x). Entropy can be an optimal method for measuring 



predictive uncertainty, but that does not mean it is good for AL. Many authors (e.g., Joshi 



et al. 2009 ) have observed that it can be too greedy: entropy can be very high near the best 
explored class boundaries. Several, largely unsatisfactory, remedies have been suggested in 
the literature. Fortunately, no remedy is required for the AD analog, which focuses on the 
lowest entropy active data, a finite set. The discarded points will tend to be far into the 
class interior, where they can be safely subsumed into the prior. Their spacing and shifting 
of the active pool is quite similar to discarding by ALC for regression, and so we do not 
illustrate it here. 



4.3 Fast local updates of active discarding statistics with trees 

The divide and conquer nature of trees — whose posterior distribution is approximated by 
thrifty, local, particle updates — allows AD statistics to be updated cheaply too. If each leaf 
node stores its own AD statistics, it suffices to update only the ones in leaf nodes which have 
been modified, as described below. Any recalculated statistics can then be subsumed into 
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a global, particle averaged, version. Note that no updates to the AD statistics are needed 
when a point is retired since the predictive distributions are unchanged. 

When a new datapoint (x, y) arrives, the posterior undergoes two types of changes: 
resample then propagate. In the resample step the discrete particle distribution changes, 
although the trees therein do not change. Therefore, each discarded particle must have 
its AD statistics (stored at the leaves) subtracted from the full particle tally. Then each 
correspondingly duplicated particle can have its AD statistics added in. No new integrals 
(for ALC) or entropy calculations (for classification) are needed. In the propagate step, each 
particle undergoes a change local to r/(x)® S T t . This requires first calculating the AD 
statistic for the new (x, y) for each ?]W(x), before the dynamic update occurs, and then 
swapping it into the particle average. New integrations, etc., need evaluating here. Then, 
each non-stay dynamic update triggers swap of the old AD statistics in 7yW(x) for freshly 
re-calculated ones from the leaf node(s) in ■ The total computational cost is in 0(m 2 N) 
for incorporating (x, y) into N particles, plus 0(m 2 X^Li l 7 /*^)!) to update the leavesj^] 

4.4 Empirical results 

Here we explore the benefit of AD over simpler heuristics, like random discarding and sub- 
setted data estimators, by making predictive comparisons on benchmark regression and 
classification data. To focus the discussion on our key objective for this section, we employ 
moderate data sample sizes in order to allow a comparison to full-data versions of DTs, and 
assess the impact of data discarding on performance. In particular, we do not repeat here 
a comparison of full-data DTs to competitors, which may be found in |Taddy et al\ ( |2011[ ) 



but emphasise that discarding enables DTs to operate on (arbitrarily long) data streams, 
where the original DTs, as well as their main GP-based competitors, will eventually become 
intractable. This is better illustrated by the use of massive and streaming classification 
datasets in Section [U 

Simple synthetic regression data 

We first consider data originally used to illustrate multivariate adaptive regression splines 



(MARS) (Friedman, 1991), and then to demonstrate the competitiveness of DTs relative to 



modern (batch) nonparametric models (Taddy et al. 2011). The response is 10 sin(7rxiX2) + 
20(x3 — 0.5) 2 + 10x4 + 5^5 plus A/"(0, 1) additive error. Inputs x are random in [0, l] 5 . We 
considered four estimators: one based on 200 pairs (ORIG), one based on 1800 more for 
2000 total (FULL), and two online versions using either random (ORAND) or ALC (OALC) 
retiring to keep the total active data set limited to w = 200. ORIG is intended as a lower 
benchmark, representing a naive fixed-budget method; FULL is at the upper end. The full 
experiment was comprised of 100 repeats in a MC fashion, each with new random training 
sets, and random testing sets of size 1000. N = 1000 particles and a linear leaf model were 
used throughout. Similar results were obtained for the constant model. 

Figure [2] reveals that random retiring is better than subsetting, but retiring by ALC is 
even better, and can be nearly as good as the full-data estimator. In fact, OALC was the best 



2 One might imagine a thriftier, but harder to implement version, which waits until the end to calculate 
the AD statistic for the new point (x, y). But it would have the same computational order. 
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Figure 2: Friedman data comparisons by average posterior predictive density (higher is 
better) and RMSE (lower is better). 



predictor 16% and 28% of the time by average predictive density and RMSE, respectively. 
The average time used by each estimator was approximately 1, 33, 45, and 67 seconds, 
respectively. So random retiring on this modestly-sized problem is 2-times faster than using 
the full data. ALC costs about 18% extra, time-wise, but leads to about a 35% reduction 
in RMSE relative to the full estimator. We note that in much larger problems the gap 
between the online and full estimators can widen considerably. The time-demands of the full 
estimator grow roughly as tlogi, whereas the online versions stay constant. 



Spam classification data 



Now consider the Spambase data set, from the UCI Machine Learning Repository (Asuncion 



and Newman, 2007| ). The data contains binary classifications of 4601 emails based on 57 



attributes (predictors). We report on a similar experiment to the Friedman/regression ex- 
ample, above, except with classification leaves and 5-fold CV to create training and testing 
sets. This was repeated twenty times, randomly, giving 100 sets total. Again, four estimators 
were used: one based on 1/10 of the training fold (ORIG), one based on the full fold (FULL), 
and two online versions trained on the same stream(s) using either random (ORAND) or 
entropy (OENT) retiring to keep the total active data set limited to 1/10 of the full set. 
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Figure 3: Spam data comparisons by average posterior predictive probability (higher is 
better) and misclassification rate (lower is better) on the testing set(s). 

Figure [3] tells a similar story to the Friedman experiment: random discarding is better 
than subsetting, but discarding by entropy is even better, and can be nearly as good as 
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the full-data estimator. Entropy retiring resulted the best predictor 7% of the time by 
misclassification rate, but never by posterior predictive probability. 



5 Temporal adaptivity using forgetting factors 

The accumulation of historical information at the leaf priors introduced by data retirement 
may eventually overpower the likelihood of active datapoints. This is natural in an i.i.d. 
setting, but may cause performance deterioration in streaming contexts where the data 
generating mechanism may evolve or change suddenly. To promote responsiveness, we may 
exponentially downweight the retired data history s when retiring an additional point y m : 
7r[ new '(0) oc L[6 | y m )L x (8; (y s , x s )/ s i )7To(0). Observe that for A = 1, conjugate Bayesian 
updating is recovered, and for A = 0, the retired history is disregarded altogether, effectively 
resetting the prior. For A 6 (0, 1), two effects are introduced. First, the overall 'strength' of 
the prior relative to the likelihood is diminished. Second, as the prior is sequentially updated, 
it will place disproportionately more weight on recently retired datapoints as opposed to 
older retired data. For the leaf models entertained in this paper, a recursive application 
of this principle, with A G (0,1), modifies only slightly the conjugate updates of Section 
[3, as follows. For the linear and constant models, we have (A( new )) 1 = AA _1 + X' m X m , 
R (ncw) _ AR+X^y m , s( new ) = Xs + y' m y m , and z> cw ) = Ai/+1, whereas for the multinomial, 
we get a^ new ^ = Aa + z m . For A < 1, K and v will be bounded above by their limiting value 
j^tt, irrespective of the total number of retired datapoints. 



In Ibrahim et al. (2003), this family of priors is shown to satisfy desirable information- 



theoretic optimality properties. Exponential downweighting as a means of enabling temporal 



adaptivity also has a long tradition in non-stationary signal processing (Haykin, 1996), as 



well as streaming classification ( Anagnostopoulos et al. , 2009 ) , where A is often referred to 
as a forgetting factor. 

In historical discarding, it is perhaps obvious that some degree of forgetting will be 
useful in drifting contexts, as the contribution of past data becomes decreasingly useful with 
time. The relationship between forgetting and other types of active discarding is however 
more complex. In principle, any successful active discarding scheme will lead to priors 
being populated by less relevant datapoints, so that the model can benefit from forgetting 
in favour of putting more weight on highly relevant, active data. Unfortunately, in the 
presence of drift, we cannot guarantee such reasonable behaviour from active discarding 
heuristics of the form proposed here. As these latter are reliant on an i.i.d. assumption, they 
can often mistake obsolete datapoints that are poorly explained by the model for 'highly 
informative, surprising' datapoints that had better be retained, so that it becomes less clear 
a priori whether the active data pool or the prior should be 'trusted' more, and the utility of 
forgetting becomes questionable. More sophisticated active learning heuristics are required 
to resolve this problem, which lie beyond the scope of this paper. We will thus only explore 
the interaction of forgetting with historical discarding henceforth. 

5.1 Synthetic drifting regression data 



We now revisit the Friedman dataset from 4.4 and introduce smooth drift by replacing the 



non-linear term 10 sin(7rxiX2) with a time- varying version, 10a t sin(7rxiX2). The coefficient at 
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is allowed to vary smoothly between —1 and 3 over time as at = 2 sin(27rfci/1000) + l, so that k 
controls the speed of the drift: k = 1 producing one full cycle every 1000 timesteps. Note that 
as Of increases in magnitude, the non-linearity of the regression surface will accentuate, as the 
first term is responsible for much of its complexity. The simulation measures 1-step-ahead 
performance of the DT as follows: at each timestep t, it first generates 5 datapoints from 
the current model; these are used as test datapoints to measure the predictive probability 
and RMSE of the dynamic tree (trained using data up to time t — 1); and, finally, the DT 
is updated on the basis of the new data. 
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IB 



bb Bb 



H0.975 H0.95 H0.9 



H0.975 H0.95 H0.9 H0.8 



H0.7 H0.5 



RMSE (lower Is better) 



RMSE (lower is better) 



H0.99 H0.975 H0.95 H0.9 H0.B5 HO.B H0.7 H0.5 H0.3 H0.1 



Baa 



Base 



BB 



H1 H0.99 H0.975 H0.95 H0.9 H0.85 H0.8 H0.7 H0.5 H0.3 H0.1 



Figure 4: Slowly (left) and rapidly (right) drifting Friedman data comparisons by average 
posterior predictive density (top - higher is better) and RMSE (bottom - lower is better), 
for various degrees of forgetting. 

In the Figure [4] plots, we plot the RMSE and predictive probability observed over 100 
MC iterations for a sequence of A values between A = (discarding with retiring) and A = 1 
(retirement via Bayesian conjugate updating). Reassuringly, a U-shaped curve appears, 
indicating a trade-off between throwing away too much information at one extreme (A = 
0), and retaining obsolete information at the other (A = 1). For rapidly changing data 
distributions (k = 1), a value of A = 0.8 seems to perform best. Repeating the experiment 
for slower-changing data distributions (k = 0.1) produces performance that peaks at A = 0.97 
instead, confirming our intuition. 

In Figure [5j we investigate the effect that discarding and forgetting have on model com- 
plexity, as measured by average tree height over time. To do so, we again generate drifting 
Friedman data, this time with a% = 10 between t = 10000 and t = 20000 (denoted by vertical 
lines in Figure [5]) , and otherwise so that model complexity rises sharply and then drops 
again. We deploy a DT without discarding (i.e., sequentially incorporating the full dataset), 
a DT with a fixed budget of 100 active datapoints and no forgetting, and one with the same 
budget and mild forgetting (A = 0.9). First observe that capping the active data pool size 
significantly penalises model complexity on the whole. Also note that all three methods 
react to the rise in complexity at t = 10 4 by favouring deeper trees. However, once the data 
complexity drops again at t = 2 x 10 4 , both the full model and the online model without 
forgetting (A = 1) retain their average tree depth, failing to return to earlier levels. By 
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5000 10000 15000 20000 25000 30000 
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Figure 5: Average tree height over time for the full model, and historical discarding with 
A = 0.9 or A = 1. The true regression surface complexity rises in t £ [10 4 , 2 x 10 4 ]. 



contrast, A = 0.9 allows the model to adapt to the change, as the priors are more easily 
outweighed by the impact of novel information. 

5.2 Synthetic drifting classification data 

We now turn to streaming classification. We henceforth adopt the standard one-step-ahed 
performance assessment paradigm, wherein the algorithm at time t first predicts the class 
label of the unlabelled (t + l)th datapoint, and is then allowed to use both the datapoint 
itself and its label to update its parameters. 

We first consider a classification problem where the optimal decision boundary is always 
non-linear, but drifts in time in such a way so that older data become increasingly misleading 
for future predictions. This effect can be synthesised by rotating a 'fuzzy' XOR problem, 
displayed in the left plot of Figure [6j The XOR forces a non-linear decision boundary, 
whereas the rotation implies that recent data should have priority over older data. This 
example, which we refer to as MOVINGTARGET, is an extreme one since in general drift 
could also manifest itself in ways that render past information useless, but not outright 
misleading. Even in such cases, data discarding and forgetting may be useful to 'free up' 
degrees of freedom, but the effect is unlikely to be as dramatic and would therefore be harder 
to measure. 

In the right plot of Figure [6] and the rightmost column of Table [TJ we illustrate the 
effect that introducing a forgetting factor has on the performance of a DT with historical 
discarding against MOVINGTARGET. We compare against two state-of-the-art methods, 



Quadratic Discriminant Analysis with Adaptive Forgetting (QDA-AF) ( Anagnostopoulos 



et al. , 2009), and Online Linear Discriminant Analysis with Adaptive Learning Rate (OLDA- 



ALR) ( Kuncheva and Plumpton , 2008 ) , both designed with streaming classification contexts 



in mind. In Table [TJ performance is measured in three ways: correct classification rate, Area 



Under the Curve, the H- measure, a newly preferred alternative to the AUC (Hand 2009) 
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Figure 6: Left: four snapshots of the drifting synthetic classification data underlying this 
simulation study. Right: average classification performance (Correct Classification Rate) 
over time. 



Table 1: Performance in terms of Area Under the Curve (AUC), H- measure (H) and Correct 
Classification Rate (CCR) for each of three datasets (two real and one simulated), and three 
instantiations of dynamic trees as well as two competitive methods. 





ELEC2 
AUC H CCR 


FAUD 
AUC H CCR 


MOVINGTARGET 
AUC H CCR 


OFFLINE (n = 10 4 ) 
ONLINE (A = 1) 
ONLINE (A = 0.8) 
QDA.AF 
OLDA.ALR 


0.771 0.267 0.702 
0.761 0.274 0.724 
0.880 0.480 0.808 
0.924 0.643 0.873 
0.763 0.239 0.692 


0.588 0.048 0.941 
0.724 0.155 0.971 
0.930 0.622 0.982 
0.973 0.920 0.983 
0.832 0.414 0.974 


0.562 0.044 0.655 
0.528 0.011 0.656 
0.668 0.111 0.609 
0.504 0.001 0.560 
0.507 0.001 0.656 



In all three respects, data discarding hardly improves performance when A = 1, whereas 
for A = 0.9 significant improvement is possible. For an explanation, consider the way in 
which classification performance evolves over time, shown on the rightmost plot of Figure [6} 
the detrimental effect of obsolete, misleading data becomes visually obvious for the full-data 
model, as well as the online model with A = 1. Interestingly, these latter two methods, 
although otherwise distinct, share a similar performance bottleneck with OLDA-ALR that 
DTs with forgetting to decidedly overcome. 

Now consider two real datasets, ELEC2, and FRAUD, which are known to exhibit con- 
cept drift ( Anagnostopoulos et al. 2009). The former holds information for the Australian 
New South Wales Electricity Market and was introduced in Baena-Garcia et al. (2006), com- 
prising 27552 instances, each referring to a period of 30 minutes. The class label identifies 
the price change related to a moving average of the last 24 hours, and the four covariates 
capture aspects of electricity demand and supply. The latter dataset, FRAUD, is of length 
prohibitive to many existing methods (n = 100, 000), and contains information about credit 
card transactions, and their respective status as legitimate or fraudulent, determined by 
experts (see Anagnostopoulos et al. (2009) for more details). Results in terms of CCR over 
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time are presented in Figure [7| The full suite of numerical results is provided in Table [TJ 




Figure 7: Left: Electricity Market data. Right: Fraud data. Average classification perfor- 
mance (Correct Classification Rate) over time. 

QDA-AF, which was vastly inferior in the synthetic MOVINGTARGET example, dominates 
in these examples (particularly for FRAUD). Comparisons between the remaining methods 
paint an interesting picture. In both datasets, FULL is among the lowest performers, sug- 
gesting that data discarding is essential to maintaining a representative fit against drifting 
data distributions. Among DTs, A = 1 performs poorly, as informative priors accumulate 
irrelevant or misleading information. In contrast, A = 0.8 is much better, outperforming 
OLDA-ALR. Although the ranking of various classifiers will generally differ by application, 
these experiments give a strong signal that the use of forgetting factors can turn DTs into a 
promising, flexible tool for streaming classification. 

6 Conclusion 

In this work, we strive to fully utilise the potential of Bayesian machinery in the context of 
streaming non-parametrics. We propose data retirement via conjugate Bayesian updating in 
the context of SMC inference for a dynamic tree model, preserving non-parametric flexibility 
while enabling constant memory online operation. Second, the availability of tractable pre- 
dictive distributions allows us to devise computationally efficient active retirement heuristics, 
hence maintaining a fixed budget of highly informative datapoints. Both features minimise 
information loss incurred by single-pass processing. Finally, we deploy informative power 
priors to enable temporal adaptivity. This results in a novel, powerful algorithmic scheme 
for non-parametric regression and classification tailored to the massive and streaming data 
contexts. As future work, we intend to pursue techniques for automatic tuning of forgetting 
factors in streaming contexts, and their interplay with active retirement heuristics. 
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